NodeFlow: Towards End-to-End Flexible Probabilistic Regression on Tabular Data

We introduce NodeFlow, a flexible framework for probabilistic regression on tabular data that combines Neural Oblivious Decision Ensembles (NODEs) and Conditional Continuous Normalizing Flows (CNFs). It offers improved modeling capabilities for arbitrary probabilistic distributions, addressing the limitations of traditional parametric approaches. In NodeFlow, the NODE captures complex relationships in tabular data through a tree-like structure, while the conditional CNF utilizes the NODE’s output space as a conditioning factor. The training process of NodeFlow employs standard gradient-based learning, facilitating the end-to-end optimization of the NODEs and CNF-based density estimation. This approach ensures outstanding performance, ease of implementation, and scalability, making NodeFlow an appealing choice for practitioners and researchers. Comprehensive assessments on benchmark datasets underscore NodeFlow’s efficacy, revealing its achievement of state-of-the-art outcomes in multivariate probabilistic regression setup and its strong performance in univariate regression tasks. Furthermore, ablation studies are conducted to justify the design choices of NodeFlow. In conclusion, NodeFlow’s end-to-end training process and strong performance make it a compelling solution for practitioners and researchers. Additionally, it opens new avenues for research and application in the field of probabilistic regression on tabular data.


Introduction
Tabular regression involves predicting a continuous target variable based on structured data arranged in a tabular format.It is a vital task in machine learning with applications in various domains, including finance, healthcare, and marketing.In these domains, making reliable and informed decisions is of utmost importance due to potential consequences or impacts and requires not only accurate predictions but also robust uncertainty quantification.These kinds of properties can be obtained by the usage of probabilistic methods that go beyond point estimation by modeling the entire conditional distribution.This approach offers several advantages, including the ability to quantify uncertainty, capture complex data distributions, and provide a more comprehensive understanding of the data.
Regarding deterministic tabular regression, there have been two distinct paths of research in the field of regression on tabular data without any clear conclusion of the best approach to the problem [1,2].The first path focuses on gradient-boosted trees, exemplified by popular approaches such as XGBoost [3], CatBoost [4], and LightGBM [5].These methods have demonstrated remarkable performance in point estimation tasks, leveraging ensemble techniques to capture complex relationships in the data.The second research path explores deep learning techniques for regression on tabular data with models such as NODE [6], TabNet [7], or FT-Transfomer [8].These methods, with their ability to capture intricate patterns and relationships, have shown promise in surpassing the performance of gradient-boosted trees.They offer flexibility in handling various data types, including

•
We introduce NodeFlow, to the best of our knowledge, the first framework to apply an end-to-end, tree-structured deep learning model for probabilistic regression on tabular data; • We demonstrate NodeFlow's superior performance in multivariate probabilistic regression and competitive results in univariate tasks on benchmark datasets, establishing its effectiveness; • We conduct a focused ablation study, hyperparameter sensitivity analysis, and computational efficiency assessment, validating NodeFlow's design and scalability.

Literature Review 2.1. Tree-Based Regression on Tabular Data
Standard tree-based regression approaches, including XGBoost [3], CatBoost [4], and LightGBM [5], have emerged as state-of-the-art methods for modeling tabular data in regression problems.These frameworks leverage ensemble techniques and advanced optimizations to achieve remarkable performance in various domains.XGBoost is an optimized gradient-boosting framework that combines decision trees to capture complex relationships in tabular data.CatBoost incorporates novel techniques to handle categorical features effectively, while LightGBM utilizes tree-based learning algorithms and efficient data processing strategies.Their widespread adoption and success in diverse applications highlight their effectiveness and prominence in the field of tabular regression modeling, enabling accurate point estimation and capturing intricate patterns within the data.

Tree-Based Probabilistic Regression on Tabular Data
In recent years, several approaches have been developed for probabilistic regression on tabular data, including NGBoost [9], CatBoost with univariate Gaussian support [11], and the Probabilistic Gradient Boosting Machine (PGBM) [10], each offering unique methods to model probabilistic distributions and improve regression performance.NGBoost is a versatile algorithm that can model various probabilistic distributions using a defined probability density function.It estimates distribution parameters by optimizing scoring rules such as the negative log-likelihood (NLL) or Continuous Ranked Probability Score (CRPS).
RoNGBa [13] is an NGBoost extension that enhances performance through improved hyperparameter selection.CatBoost, a gradient-boosting framework, has also been adapted to probabilistic regression but supports only univariate Gaussian distributions.PGBM treats leaf weights as random variables and can model different posterior distributions, albeit limited to location and scale parameters.

Deep Learning Regression on Tabular Data
In recent years, deep neural networks have achieved remarkable success in handling unstructured data, but their effectiveness in dealing with tabular data remains inconclusive.Several research papers, including [6][7][8]14,15], have introduced new deep learning regression methods that demonstrate superiority over tree-based methods.However, recent surveys have produced conflicting results on this topic.Notably, Borisov et al. [1] conducted a study comparing deep models to traditional machine learning methods on selected datasets.They found that deep models consistently outperformed traditional methods, but no single deep model universally outperformed all others.These findings highlight the nuanced performance of deep learning models on tabular data.Additionally, recent benchmarks conducted by Grinsztajn et al. [2] compared tree-based models and deep learning methods, specifically on tabular data.The benchmarks revealed that tree-based models such as XGBoost and random forests remain state-of-the-art for mediumsized datasets (with fewer than 10,000 samples).Notably, even without considering their superior processing speed, tree-based models maintained a competitive edge over deep learning approaches.
Neural Oblivious Decision Ensembles (NODEs), introduced by [6], are a deep learning architecture that extends ensembles of oblivious decision trees.It combines end-to-end gradient-based optimization with multi-layer hierarchical representation learning.DNF-Net, proposed by [7], is a neural architecture incorporating a disjunctive normal form (DNF) structure, allowing efficient and interpretable feature selection.It promotes localized decisions over small feature subsets, enhancing interpretability and mitigating overfitting.TabNet [14] is a deep learning architecture specifically tailored for tabular data.It processes raw tabular data without preprocessing, facilitating seamless integration into end-to-end learning.Sequential attention mechanisms identify crucial features at each decision step, enhancing interpretability and learning efficiency.TabNet also provides interpretable feature attributions and insights into the model's global behavior.Gorishniy et al. [8] proposed FT-Transformer, a modified version of the Transformer architecture designed for tabular data.FT-Transformer incorporates both categorical and continuous features, employs selfattention mechanisms to capture feature relationships, and integrates residual connections akin to ResNet.In addition to these approaches, SAINT (Self-Attention and Intersample Attention Transformer) [15] is a hybrid deep learning approach designed to solve tabular data problems.SAINT integrates attention over both rows and columns, an enhanced embedding method, and a contrastive self-supervised pre-training technique.

Deep Learning Probabilistic Regression on Tabular Data
Recently, there has been limited research on Probabilistic Deep Learning for tabular data.One notable method in this area is Deep Ensemble [16], which involves training an ensemble of neural networks using negative log-likelihood optimization with a Gaussian distribution as the modeling choice.The authors also incorporate adversarial training to produce smoother predictive estimates.Another approach, MC-Dropout [17], extends the use of dropout to capture model uncertainty during inference.By sampling multiple dropout masks during inference and averaging the predictions over these masks, an ensemble of models is created to capture model uncertainty collectively.Probabilistic Backpropagation [18] treats the neural network weights as random variables and approximates their posterior distribution using a factorized Gaussian distribution.This approximation is updated iteratively utilizing a combination of variational inference and stochastic gradient descent.More recently, TreeFlow [12] introduced a tree-based approach that combined the advantages of tree ensembles with the flexibility of modeling probability distributions using normalizing flows.By using a tree-based model as a feature extractor and combining it with a conditional variant of normalizing flow, TreeFlow enabled the modeling of complex distributions in regression outputs.While TreeFlow has shown superior performance in some cases, its lack of end-to-end training may result in suboptimal results.
In conclusion, the existing methods for probabilistic regression on tabular data often have limitations in terms of their modeling flexibility or end-to-end training.NodeFlow addresses these limitations by combining the tree-based NODE with the flexibility of CNFs, offering end-to-end training and a unique solution for probabilistic regression on tabular data.

NodeFlow
The architecture of NodeFlow is provided in Figure 1.The real-valued input vector x of dimensionality D is initially processed using a Neural Oblivious Decision Ensemble, consisting of NODE Layers (details of the layer are depicted in Figure 2) arranged in a multilayer hierarchical structure.It allows the extraction of rich hierarchical representation w .We use that vector as a conditioning factor for the conditional Continuous Normalizing Flow (CNF) in the next step.This component is responsible for the flexible modeling of the conditional probabilistic distribution of vector y .It is worth mentioning that there are no restrictions on the response vector dimensionality.Thus, we could cover both uniand multivariate regression problems.The whole architecture is trained in an end-to-end fashion using gradient-based optimization.

Extracting Hierarchical Representation with NODE
In order to extract a rich hierarchical representation for a given input x, we utilize Neural Oblivious Decision Ensemble (NODE) h ϕ (x) parametrized by ϕ, which is a machine learning architecture that combines differentiable oblivious decision trees f(x) (ODTs).In this section, we start by introducing the ODTs.Then, we discuss the composition of the ODTs into the NODE Layer, and finally, we present the NODE component responsible for the hierarchical representation extraction in NodeFlow.
A single differentiable oblivious decision tree f(x) of depth d is defined as: where r = [r 1 , . . ., r 2 d ] is a 2 d -dimensional vector of real-valued trainable responses for each of the considered leaves in the tree, and vector of real-valued entries from the range [0, 1].The vector is called a "choice vector" and corresponds to the probability of the sample ending up in the specific leaf.
To compute the choice vector, it is requisite to perform a multiplication of the probabilities associated with selecting either the left or right path across successive depth levels within the tree structure.It is important to note that in an oblivious decision tree, only one decision is made at each level of depth, which is referred to as c i (x) at depth i.The final choice vector l is derived using the formula: where ⊗ denotes the Kronecker product.
To ensure differentiability during training in the tree split, we utilized the α-entmax function [19], which generalizes the Softmax (α = 1) and Sparsemax (α = 2) functions and allows for the learning of sparse choices through gradient-based learning methods.The feature choice function c i (x) is then calculated as a two-class entmax function over the transformed output of the feature selection function k i (x).This can be expressed formally as: where b i and τ i are learnable threshold and scale parameters, and α is the entmax function's hyperparameter that controls the level of "sparsity" in the output.In addition, the function for selecting differentiable features can be written as follows: where p (i) is the D-dimensional vector of feature selection weights given by the formula ).Moreover, F ∈ R d×D is called the feature selection matrix, and it is a real-valued, learnable matrix.In summary, the differentiable oblivious decision tree, denoted as f, is parameterized by the response vector r, threshold values τ, scale factors b, and the feature selection matrix F, facilitating gradient-based learning.
To form the Neural Oblivious Decision Ensemble layer F l (depicted in Figure 2), we need to concatenate all outputs of the T individual f 1 , . . ., f T ODTs forming the layer.The final output can be written as Finally, the NODE architecture h ϕ (x) is composed of L stacked NODE layers in a similar fashion to the DenseNet model.It means that each layer takes the concatenated outputs of all previous layers as input, allowing the model to learn both low-level and high-level features.It can be written as: The outputs from each layer are concatenated to create the final representation extracted using NODE, w = [w 1 , . . ., w L ] = h ϕ (x).The representation w is further delivered to CNFs as a conditioning factor.

Probabilistic Modeling with CNFs
We consider the conditional variant of CNFs provided in [20,21], where the conditional factor w = h ϕ (x) is delivered to the function of the dynamics of z(t), g β (z(t), t, w), parametrized by β.In the CNF setting, we aim at finding a solution y := z(t 1 ) for the differential equation, assuming the given initial state z := z(t 0 ) with a known prior, where z is a random variable, z(t 0 ) is a base distribution, and z(t 1 ) constitutes our observable data.Moreover, t 0 and t 1 denote the start and end points, respectively, of the continuous transformation process.The transformation function between y and z is represented as: The inverse form of the transformation u β,ϕ (•) is given by equation: Finally, we can calculate the log-probability of target variable y given the vector of features x by the following formula: which can be solved analogously to FFJORD [22] by employing the adjoint method to backpropagate through the solution of the neural ODE.

Training NodeFlow
Using the formula (9) that directly defines log-probability, we can train NodeFlow by directly optimizing the negative log-likelihood function.Let us assume we are given a dataset D = (x n , y n ) n=1..N , where x n = (x 1 n , . . ., x D n ) represents a D-dimensional random feature vector, and y n = (y 1 n , . . ., y P n ) is the P-dimensional vector of targets.The training of the probabilistic model involves minimizing the conditional negative log-likelihood function (NLL), defined as: The goal during the training process is to find the optimal parameters β * and ϕ * such that: All model parameters β, ϕ are trained end to end by optimizing the above-mentioned NLL using the standard gradient-based approach.Such an approach simplifies the modeling process by allowing the entire model to be trained using a single optimization algorithm.Moreover, the model can automatically learn relevant hierarchical representations of the data directly from the raw input data, capturing both low-level and high-level features.This eliminates the need for manual feature engineering, which can be time-consuming and require domain expertise.

Experiments
In this section, we present a comprehensive set of experiments to evaluate the performance and effectiveness of NodeFlow in the context of tabular regression problems.We aimed to assess NodeFlow's capabilities in capturing complex data distributions, generating accurate point estimates, and quantifying uncertainty.To achieve this, we conducted evaluations on univariate and multivariate benchmark datasets, comparing NodeFlow with other reference methods.We measured the performance using various evaluation metrics such as the negative log-likelihood (NLL), Continuous Ranked Probability Score (CRPS), and Root-Mean-Square Error (RMSE).Through these experiments, we aimed to demonstrate the performance and flexibility of NodeFlow in probabilistic regression tasks, contributing to the advancement of the field and providing insights for practical applications.

Methodology
In our evaluation, we adhered to the established probabilistic regression benchmark, as delineated in previous studies [9,11,12], excluding the Boston dataset in consideration of ethical concerns [23].For univariate regression, we employed nine datasets from the UCI Machine Learning Repository and six datasets for multivariate regression as suggested by [12], with comprehensive dataset details provided in the Appendix A. In alignment with protocols from the referenced literature, we generated 20 random folds for the univariate regression datasets (with the exception of Protein at five folds and Year MSD at a single fold), designating 10% of the data for testing in each fold.The remainder was divided into an 80%/20% training/validation split for epoch selection.Our results are presented as the mean and standard deviation across validation folds.We benchmarked NodeFlow against a suite of models, including four tree-based probabilistic models (NGBoost, RoNGBa, Cat-Boost, PGBM), a deep learning approach (Deep Ensemble), and a hybrid model (TreeFlow) for univariate tasks.For multivariate regression challenges, we adopted training/testing splits as per the referenced protocols, comparing NodeFlow against NGBoost variants and TreeFlow.The architecture specifics and hyperparameter tuning methodology for NodeFlow are detailed in the Appendix B.

Probabilistic Regression Framework
This segment evaluates NodeFlow's performance within a probabilistic framework, analyzing its negative log-likelihood (NLL) scores against benchmark datasets for both univariate and multivariate regression tasks previously outlined.
In Table 1, we present the evaluation results for the univariate regression task, where NodeFlow exhibited competitive performance across a range of datasets, frequently achieving the best or second-best NLL scores.Notably, NodeFlow excelled on the Year MSD dataset and secures commendable second-best results on the Wine, Protein, Power, and Kin8nm datasets.Our analysis extended to a detailed comparison of NodeFlow against various methodological approaches, including deep learning-based methods, tree-based ensemble methods, and the hybrid method TreeFlow.Against the Deep Ensemble, NodeFlow consistently demonstrated superior or at least equivalent performance, with particularly noteworthy achievements on the Energy, Power, Protein, Wine, and Yacht datasets.This is especially significant for the Protein and Wine datasets, which are characterized by their underlying multimodal target distributions-a scenario where NodeFlow's capabilities of flexible distribution modeling were especially advantageous (refer to [12] for details).When compared to tree-based methods such as CatBoost, NGBoost, RoNGBa, and PGBM, NodeFlow maintained a competitive edge, often outperforming or matching the best results, underscoring its robust ability to model complex data relationships within tabular datasets.In direct comparison with TreeFlow, NodeFlow and TreeFlow exhibited closely matched performance, with each method surpassing the other under different circumstances.This comparative analysis not only highlights NodeFlow's versatile efficacy across a broad spectrum of univariate regression challenges but also its capacity to address the intricacies of tabular data modeling through its advanced, adaptive learning framework.In Table 2, we detail NodeFlow's performance across multivariate probabilistic regression tasks, where it consistently outperformed competing approaches in five of the six datasets examined.Compared with TreeFlow, NodeFlow's superiority was particularly evident in datasets with multiple target dimensions, such as scm20d (16 target dimensions) and Energy (17 target dimensions).For two-dimensional target datasets like Parkinsons and US Flight, NodeFlow continued to outperform, albeit with a narrower margin.The distinction became more nuanced with one-dimensional targets, as presented in prior analyses, where NodeFlow and TreeFlow showed competitive yet comparable results.This differentiation underscores the strength of NodeFlow's end-to-end learning model, which excels in complex, high-dimensional settings by providing finely tuned representations.Such comprehensive learning is absent in TreeFlow, limiting its effectiveness in comparison.This evidence reinforces the indispensable value of end-to-end learning in achieving optimal performance, particularly in addressing the intricate demands of multivariate regression problems.

Point-Prediction Regression Setup
This section assesses the effectiveness of our method in a point-prediction context by comparing its Root-Mean-Square Error (RMSE) scores on the univariate regression datasets.To calculate the RMSE results for the TreeFlow and NodeFlow methods, we used the RMSE@K metric introduced in [12], where K = 2.This metric is suitable for uni-and multivariate regression problems with multiple-point predictions.We present the results in Table 3.Our method achieved the best results on two datasets and ranked second on two others.For the remaining datasets, it remained competitive with benchmark methods.Notably, these results are commendable, considering our approach is designed for probabilistic setups.Providing point estimates, particularly from multimodal distributions, presents unique challenges compared to simply taking the mean of parametric distributions like Gaussian.This context underscores the strength of our method's performance across various datasets.

Summary
In summary, our evaluation of NodeFlow across both probabilistic and point-prediction scenarios demonstrates its efficacy.While NodeFlow's performance on tasks with onedimensional targets aligns with existing benchmarks, it distinctly excels in handling problems with two or more target dimensions.The results unequivocally indicate that the greater the dimensionality of the target variable, the more pronounced NodeFlow's superiority becomes.This superior performance is attributed to NodeFlow's flexible probabilistic modeling and comprehensive end-to-end learning approach, ensuring highly tailored representations for complex problems.Consequently, NodeFlow stands out as a superior method for probabilistic regression tasks involving high-dimensional targets, affirming its suitability for addressing advanced modeling challenges.

Ablation Studies
In the pursuit of a comprehensive understanding of NodeFlow method, a series of ablation studies were undertaken to scrutinize the impacts of critical design choices therein.Specifically, this investigation focused on two integral constituents: the feature representation component, in NodeFlow attained by the usage of NODEs, and the probabilistic modeling segment, which was realized through the utilization of CNFs.We evaluated our methods using both probabilistic and point-prediction frameworks.Additionally, we conducted a qualitative analysis of the learned representations and estimated probability density functions.Moreover, the results of the computational time comparison are included in Section 6.

Feature Representation Component
In our ablation study, we assessed the critical role of the Neural Oblivious Decision Ensemble (NODE) component in enhancing feature extraction within our proposed framework, NodeFlow.To this end, we conducted both quantitative and qualitative analyses, employing two benchmarking variants for comparison: one with the NODE component removed, relying solely on min-max scaling (termed as CNF), and another replacing the NODE with a shallow Multilayer Perceptron (MLP), labeled as CNF + MLP.
Quantitative results, detailed in Table 4, evaluate the performance across probabilistic and point-prediction metrics: negative log-likelihood (NLL), Continuous Ranked Probability Score (CRPS), and Root-Mean-Square Error at 2 (RMSE@2), presented as mean values alongside their standard deviations.The experimental setup was kept consistent with the main experiments.
Our findings reveal that NodeFlow, with the NODE component integrated, consistently delivered the lowest NLL values across a majority of datasets, highlighting its exceptional data modeling and prediction accuracy capabilities.Additionally, NodeFlow surpassed comparative approaches in CRPS, indicating its enhanced precision in proba-bilistic forecasting.Furthermore, NodeFlow achieved the most favorable RMSE scores, underlining the NODE component's pivotal role in achieving precise point predictions.In our qualitative analysis, we visualized feature representations derived from the models, utilizing dimensionality reduction via the UMAP algorithm [24] and color-coding each point according to its target variable.Figure 3 illustrates these representations for the Energy dataset.The leftmost visualization corresponds to the CNF model, which, lacking additional processing layers, essentially reflects the rescaled raw dataset within the (−1, 1) range.The middle image depicts the representation from the CNF + MLP model, while the rightmost image shows the outcome of employing a NODE within the NodeFlow method.Comparatively, the NodeFlow method's representation, facilitated by NODE processing, showcases a significantly enhanced separation and disentanglement of observations, with distinct clusters forming around similar target values.This level of disentanglement, absent in the CNF models' representations, likely plays a crucial role in NodeFlow's superior performance across quantitative metrics.Collectively, these outcomes validate the NODE component's indispensable contribution to NodeFlow's architecture, ensuring competitive or superior performance in NLL, CRPS, and RMSE metrics and disentangled and more clearly separated representations compared to the alternatives examined.

Probabilistic Modeling Component
In this ablation study, we evaluated the effectiveness and fit of the probabilistic modeling component within our framework.Specifically, we substituted the CNF component with standard probabilistic distributions, labeling these variants as NodeGauss (using a Gaussian distribution) and NodeGMM (employing a mixture of Gaussians).This experimental design mirrors the setup of our previous ablation studies.
The findings, detailed in Table 5, indicate that NodeFlow consistently surpassed both NodeGauss and NodeGMM in the negative log-likelihood (NLL) across the majority of the datasets, with NodeGMM outperforming NodeFlow only in a single dataset instance.In terms of the Continuous Ranked Probability Score (CRPS), NodeFlow attained the lowest scores universally, indicating a more accurate calibration of predictive uncertainty relative to the alternatives.Point-prediction results further underscored NodeFlow's superiority as the most effective approach.Notably, these outcomes underscored the benefit of integrating a versatile probabilistic modeling component, as evidenced by the enhanced performance across all evaluated metrics.Figure 4 illustrates the probability density functions estimated by NodeFlow, Node-Gauss, and NodeGMM for selected samples from the Wine Quality and Protein datasets.These datasets were chosen due to their complex distributions and the significant differences in results among the models.In the Wine Quality example, NodeFlow produced a distribution concentrated between values six and seven, lacking the distinct peak characteristic of Gaussian distributions.The Protein dataset example showcased NodeFlow's ability to model a bimodal distribution with significant probability mass between peaks and a heavy right tail.Notably, both NodeGauss and NodeGMM struggled to fully capture the complexity of these sample distributions.This observation underscored the necessity for more sophisticated distributional modeling, as provided by our Conditional Normalizing Flow (CNF) component in NodeFlow.Overall, NodeFlow's uniform advantage across diverse metrics and datasets together with supporting visualizations robustly validates the integral role of the CNF component in its architecture, underscoring its indispensability for achieving optimal model performance.

Computational Time Comparison
In this analysis, we evaluated the training duration of NodeFlow relative to benchmark models from ablation studies, including CNF, CNF + MLP from the feature representation study, and NodeGauss and NodeGMM from the probabilistic modeling investigation.Our objective was to elucidate the computational demands of training each model across various datasets, as detailed in Table 6.The table delineates the mean training times and their standard deviations, offering insights into both average performance and variability.Overall, NodeFlow presents itself as a robust solution for probabilistic regression tasks on tabular data, adeptly balancing efficiency in training time with excellence in performance.This equilibrium makes NodeFlow a compelling option for both academic research and practical implementation, highlighting its potential as a preferred method in the domain.

Conclusions
In this study, we introduced NodeFlow, a novel framework for probabilistic regression on tabular data, leveraging Neural Oblivious Decision Ensembles (NODEs) and Conditional Continuous Normalizing Flows (CNFs).Our evaluations confirmed NodeFlow's exceptional capability in managing high-dimensional multivariate probabilistic regression tasks, effectively aligning with benchmarks for tasks with one-dimensional targets.Ablation studies elucidated the critical roles of the NODE and CNF components in NodeFlow's architecture, enhancing feature processing and complex distribution modeling, respectively.Moreover, NodeFlow emerges as a robust solution for advanced modeling and uncertainty quantification in regression tasks, adeptly balancing performance with computational efficiency.It not only establishes a significant presence in the domain of probabilistic regression but also lays a foundation for future advancements in machine learning interpretability and robustness.The differentiability of NodeFlow's architecture is particularly conducive to further research in interpretability techniques, including counterfactual explanations, feature attribution, and adversarial example generation, promising substantial contributions to the field's evolution.

Appendix B. Implementation Details
The research methodology adhered to the standard practices characteristic of machine learning projects.All models under consideration were implemented using Python 3.8, leveraging the deep learning library PyTorch.The training employed the usage of the PyTorch Lightning framework.We used the following infrastructure for the experiments: Intel(R) Xeon(R) Silver 4108 32-Core CPU, 4 NVIDIA GeForce GTX 1080 Ti GPUs, and 126 GB RAM.
In our research paper, we employed a Hyperband Pruner [25] as the hyperparameter search method to optimize our machine learning models.Hyperband Pruner is a highly efficient technique that focuses on identifying promising hyperparameter configurations while discarding less promising ones.To explore the hyperparameter space effectively, we uniformly sampled parameters within the specified ranges, as detailed in Table A2.Each dataset underwent a comprehensive search process, with each fold requiring a maximum duration of three hours.This approach allowed us to tune our models efficiently and select the best-performing hyperparameters, ultimately enhancing predictive capabilities of our machine learning algorithms.
Based on the results of the hyperparameter search, we conducted a comprehensive analysis to evaluate the significance of hyperparameters in the tuning process.To assess this, we employed the fANOVA Hyperparameter Importance Evaluation algorithm [26], which involves fitting a random forest regression model to predict the objective values of successfully completed trials based on their parameter configurations.The outcomes of this analysis are illustrated in Figure A1.

Figure 3 .
Figure 3. Feature representations for the Energy dataset via UMAP for the ablation study.Left: CNF model, showing rescaled data within (−1, 1).Center: CNF + MLP model, indicating improved structuring.Right: NodeFlow with NODE, illustrating the superior hierarchical organization.Points are color-coded by the target variable.

Figure 4 .
Comparison of probability density functions estimated by NodeFlow, NodeGauss, and NodeGMM for selected samples from the Wine Quality and Protein datasets.

Figure A1 .
Figure A1.Hyperparameter importance analysis in the NodeFlow tuning process.Importance scores for each dataset and searched hyperparameter were calculated using the fANOVA Hyperparameter Importance Evaluation algorithm, with the highest scores underlining their pivotal role in the optimization process.TableA2.Comprehensive overview of the hyperparameters employed in our research for optimizing the NodeFlow method.The hyperparameter ranges and settings for various datasets are detailed, allowing for a clear understanding of the tuning process.

Table 1 .
Benchmark for univariate probabilistic regression problem with tabular data using negative log-likelihood (NLL) as the metric.The best results are marked by bold text, and the second best results are underlined.

Table 2 .
Benchmark for multivariate probabilistic regression problem with tabular data using negative log-likelihood (NLL) as the metric.The best results are marked by bold text, and the second best results are underlined.

Table 3 .
Benchmark for univariate point prediction regression problem with tabular data using Root-Mean-Square Error (RMSE).Note that for TreeFlow and NodeFlow, we used the RMSE@2 metric, which is more relevant.The best results are marked by bold text, and the second best results are underlined.

Table 4 .
Ablation study of the feature representation component in terms of negative log-likelihood (NLL), Continuous Ranked Probability Score (CRPS), and Root-Mean-Square Error at 2 (RMSE@2) metrics.

Table 5 .
Ablation study of the probabilistic modeling component in terms of negative log-likelihood (NLL), Continuous Ranked Probability Score (CRPS), and Root-Mean-Square Error at 2 (RMSE@2) metrics.

Table 6 .
Comparative analysis of training duration for NodeFlow and ablation study approaches.In the feature representation study, the marginal difference in training times among NodeFlow, CNF, and CNF + MLP suggests that the NODE component's integration is cost-effective, enhancing the model output without a corresponding surge in training duration.Conversely, the probabilistic modeling study indicates a more pronounced disparity in training times, particularly between NodeFlow and the NodeGauss and NodeGMM variants, with NodeFlow achieving superior results with a proportional increase in computational time.

Table A1 .
An overview of the datasets employed in our study to assess the performance of NodeFlow.The table includes information on the number of data points (N), the number of cross-validation (CV) splits or observations in the test dataset, feature dimensionality (D), and target dimensionality (P).